My dataset is called “diamonds.csv”. It is adopted from https://www.kaggle.com/shivam2503/diamonds. This is a classical dataset that contains data on about 53940 diamonds with 10 different variables, with one of them being price.
The main goal of this project is to use the data to ultimately devise a model (albeit a best-fit line or model) that can be used to determine the price of a diamond based on the other 9 variables.
I also want to conjure up other analytical data, such as correlation, between other variables beside the price. For example, I will show the correlation between clarity and carat of a diamond, to show how a high number of one variable may correlate to a high number of the other variable. Ultimately, I hope to use this project to showcase all my R and data analytical skills I learned in Math 32.
With the code below, can see the first couple of values with the head() function and the columns. With the summary() function I get satistics about each column in the data set.
What each column is: 1. Carat = Carat weight of the diamond 2. Cut = Cut quality of the diamond 3. Color = Color of the diamond. D being the best and J as the worst 4. Depth = Depth percentage: The height of a diamond, measured from the culet to the table, divided by its average girdle diameter 5. Table = table percentage: The width of the diamond’s table expressed as a percentage of its average diameter 6. Price = Price of the diamond 7. X = Length mm 8. Y = Width mm 9. Z = Depth mm
diamonds = read.csv(file="diamonds.csv", header=TRUE, sep = ",")
head(diamonds)
## X carat cut color clarity depth table price x y z
## 1 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
The summary() command produces an output similar to the table() function on those columns that are not numeric. The ones that are calculates the Minimum, 1st Quartile, Median, Mean, 3rd Quartile, and Maximum.
summary(diamonds)
## X carat cut color clarity
## Min. : 1 Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:13486 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :26971 Median :0.7000 Ideal :21551 F: 9542 SI2 : 9194
## Mean :26971 Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:40455 3rd Qu.:1.0400 Very Good:12082 H: 8304 VVS2 : 5066
## Max. :53940 Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
##
Now it is necessary to see if all the numeric data is actually numeric. This is done in the below chunk by running the command is.numeric() on each of the columns that are supposed to contain all the numbers.
carat <- sapply(diamonds$carat, is.numeric)
table(carat) #returned all 53940 are TRUE
#Because all are numeric, we must turn all values to numbers
diamonds$carat <- sapply(diamonds$carat, as.numeric)
head(diamonds$carat)
depth <- sapply(diamonds$depth, is.numeric)
table(depth) #returned all 53940 are TRUE
#Because all are numeric, we must turn all values to numbers
diamonds$depth <- sapply(diamonds$depth, as.numeric)
head(diamonds$depth)
tab <- sapply(diamonds$table, is.numeric)
table(tab) #returned all 53940 are TRUE
#Because all are numeric, we must turn all values to numbers
diamonds$tab <- sapply(diamonds$table, as.numeric)
head(diamonds$tab)
price <- sapply(diamonds$price, is.numeric)
table(price) #returned all 53940 are TRUE
#Because all are numeric, we must turn all values to numbers
diamonds$price <- sapply(diamonds$price, as.numeric)
head(diamonds$price)
x <- sapply(diamonds$x, is.numeric)
table(x) #returned all 53940 are TRUE
#Because all are numeric, we must turn all values to numbers
diamonds$x <- sapply(diamonds$x, as.numeric)
head(diamonds$x)
y <- sapply(diamonds$y, is.numeric)
table(y) #returned all 53940 are TRUE
#Because all are numeric, we must turn all values to numbers
diamonds$y <- sapply(diamonds$y, as.numeric)
head(diamonds$y)
z <- sapply(diamonds$z, is.numeric)
table(z) #returned all 53940 are TRUE
#Because all are numeric, we must turn all values to numbers
diamonds$z <- sapply(diamonds$z, as.numeric)
head(diamonds$z)
We have three variables X, Y, Z. Looking at them individually is a solution but what would be more helpful if they were looked as one variable, a combination of X, Y, and Z, which would be XYZ, also known as the volume.
diamonds$volume = diamonds$x*diamonds$y*diamonds$z
head(diamonds)
## X carat cut color clarity depth table price x y z tab volume
## 1 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 55 38.20203
## 2 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 61 34.50586
## 3 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 65 38.07688
## 4 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63 58 46.72458
## 5 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 58 51.91725
## 6 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 57 38.69395
Here I simply plotted each of the main variables including volume and excluding x, y, and z, against the price column. The purpose of this is to get an overview of the data and determine any easily-recognized patterns. Due to an overlap in points on the scatter plot it is hard to determine whether a blot of points represents 10 points or 10000 points, thus a 3d representation of the density of the plot is also shown after each corresponding graph.
plot(diamonds$carat, diamonds$price, main = "Carat Related to Price", xlab="Carat",ylab="Price")
kd <- with(diamonds, MASS::kde2d(diamonds$carat, diamonds$price, n = 50))
plot_ly(x = kd$x, y = kd$y, z = kd$z) %>% add_surface()
plot(diamonds$depth, diamonds$price, main = "Depth Related to Price", xlab="Depth",ylab="Price")
kd <- with(diamonds, MASS::kde2d(diamonds$depth, diamonds$price, n = 50))
plot_ly(x = kd$x, y = kd$y, z = kd$z) %>% add_surface()
plot(diamonds$table, diamonds$price, main = "Table Related to Price", xlab="Table",ylab="Price")
kd <- with(diamonds, MASS::kde2d(diamonds$table, diamonds$price, n = 50))
plot_ly(x = kd$x, y = kd$y, z = kd$z) %>% add_surface()
plot(diamonds$volume, diamonds$price, main = "Volume Related to Price", xlab="Volume",ylab="Price")
kd <- with(diamonds, MASS::kde2d(diamonds$volume, diamonds$price, n = 50))
plot_ly(x = kd$x, y = kd$y, z = kd$z) %>% add_surface()
These following graphs are the same as above, except now they will correct for outliers that hinder the overview of the graphs. This is done by using a box and wisker plot to see where the outliers exist, then comes the process of removing the outliers. When we have identified the outliers, then the entire row is excluded from the data frame.
library(gridExtra)
outliers <- boxplot(diamonds$carat, main="Carat Weight of the Diamonds")$out
diamonds <- diamonds[-which(diamonds$carat %in% outliers),]
boxplot(diamonds$carat, main="Carat Weight After Outliers removed")
outliers <- boxplot(diamonds$depth, main="Depth of the Diamonds")$out
diamonds <- diamonds[-which(diamonds$depth %in% outliers),]
boxplot(diamonds$depth, main="Depth After Outliers removed")
outliers <- boxplot(diamonds$table, main="Table of the Diamonds")$out
diamonds <- diamonds[-which(diamonds$table %in% outliers),]
boxplot(diamonds$table, main="Table After Outliers removed")
outliers <- boxplot(diamonds$volume, main="Volume of the Diamonds")$out
diamonds <- diamonds[-which(diamonds$volume %in% outliers),]
boxplot(diamonds$volume, main="Volume After Outliers removed")
Now plotting against price will show any obvious relations/correlation between the variables. Then, a type of regression can be assigned. Note that these plots also come with corresponding 3d density plots same as before.
plot(diamonds$carat, diamonds$price, main = "Carat Related to Price REVISED", xlab="Carat",ylab="Price")
kd <- with(diamonds, MASS::kde2d(diamonds$carat, diamonds$price, n = 50))
plot_ly(x = kd$x, y = kd$y, z = kd$z) %>% add_surface()
plot(diamonds$depth, diamonds$price, main = "Depth Related to Price REVISED", xlab="Depth",ylab="Price")
kd <- with(diamonds, MASS::kde2d(diamonds$depth, diamonds$price, n = 50))
plot_ly(x = kd$x, y = kd$y, z = kd$z) %>% add_surface()
plot(diamonds$table, diamonds$price, main = "Table Related to Price REVISED", xlab="Table",ylab="Price")
kd <- with(diamonds, MASS::kde2d(diamonds$table, diamonds$price, n = 50))
plot_ly(x = kd$x, y = kd$y, z = kd$z) %>% add_surface()
plot(diamonds$volume, diamonds$price, main = "Volume Related to Price REVISED", xlab="Volume",ylab="Price")
kd <- with(diamonds, MASS::kde2d(diamonds$volume, diamonds$price, n = 50))
plot_ly(x = kd$x, y = kd$y, z = kd$z) %>% add_surface()
Now since the dataset has been primed, conducting simple linear regressions and determining relative correlations on the data set can help us get close to the main goal of this project.
The below r script will compute and print out the summary as well as the graph of each linear model formed by all the combinations of the four variables: carat, depth, table, and volume, with price
#CARAT vs PRICE
caratlm <- lm(diamonds$price ~ diamonds$carat)
summary(caratlm)
##
## Call:
## lm(formula = diamonds$price ~ diamonds$carat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7850.7 -766.0 -20.8 503.2 12703.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2226.17 13.59 -163.8 <2e-16 ***
## diamonds$carat 7754.80 16.24 477.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1425 on 48802 degrees of freedom
## Multiple R-squared: 0.8238, Adjusted R-squared: 0.8238
## F-statistic: 2.281e+05 on 1 and 48802 DF, p-value: < 2.2e-16
plot(diamonds$price ~ diamonds$carat, main = "Carat vs Price Linear Regression Model", xlab="Carat",ylab="Price")
abline(caratlm, col="red")
cat("EQUATION of the Line: y=", caratlm$coefficients[1], "+", caratlm$coefficients[2], "x")
## EQUATION of the Line: y= -2226.166 + 7754.804 x
#DEPTH vs PRICE
depthlm <- lm(diamonds$price ~ diamonds$depth)
summary(depthlm)
##
## Call:
## lm(formula = diamonds$price ~ diamonds$depth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3162 -2581 -1286 1452 15332
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3521.4175 880.5532 3.999 6.37e-05 ***
## diamonds$depth -0.5524 14.2462 -0.039 0.969
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3394 on 48802 degrees of freedom
## Multiple R-squared: 3.081e-08, Adjusted R-squared: -2.046e-05
## F-statistic: 0.001504 on 1 and 48802 DF, p-value: 0.9691
plot(diamonds$price ~ diamonds$depth, main = "Depth vs Price Linear Regression Model", xlab="Depth",ylab="Price")
abline(depthlm, col="red")
cat("EQUATION of the Line: y=", depthlm$coefficients[1], "+", depthlm$coefficients[2], "x")
## EQUATION of the Line: y= 3521.418 + -0.5524095 x
#TABLE vs PRICE
tablelm <- lm(diamonds$price ~ diamonds$table)
summary(tablelm)
##
## Call:
## lm(formula = diamonds$price ~ diamonds$table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4311 -2398 -1252 1384 15781
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8271.706 434.541 -19.04 <2e-16 ***
## diamonds$table 205.389 7.585 27.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3369 on 48802 degrees of freedom
## Multiple R-squared: 0.0148, Adjusted R-squared: 0.01478
## F-statistic: 733.2 on 1 and 48802 DF, p-value: < 2.2e-16
plot(diamonds$price ~ diamonds$table, main = "Table vs Price Linear Regression Model", xlab="Table",ylab="Price")
abline(tablelm, col="red")
cat("EQUATION of the Line: y=", tablelm$coefficients[1], "+", tablelm$coefficients[2], "x")
## EQUATION of the Line: y= -8271.706 + 205.3894 x
#VOLUME vs PRICE
volumelm <- lm(diamonds$price ~ diamonds$volume)
summary(volumelm)
##
## Call:
## lm(formula = diamonds$price ~ diamonds$volume)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7931.0 -764.1 -25.2 501.6 17962.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2276.2776 13.6316 -167.0 <2e-16 ***
## diamonds$volume 48.0314 0.1002 479.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1420 on 48802 degrees of freedom
## Multiple R-squared: 0.8249, Adjusted R-squared: 0.8249
## F-statistic: 2.299e+05 on 1 and 48802 DF, p-value: < 2.2e-16
plot(diamonds$price ~ diamonds$volume, main = "Volume vs Price Linear Regression Model", xlab="Volume",ylab="Price")
abline(volumelm, col="red")
cat("EQUATION of the Line: y=", volumelm$coefficients[1], "+", volumelm$coefficients[2], "x")
## EQUATION of the Line: y= -2276.278 + 48.0314 x
As it is clear that the table and the depth variables seem to have a random spread of data along with the really high errors and very small r^2 values and that these variables do not contribute much to the price of the diamonds. So, these variables would be dropped and their linear regression models are not going to be used.
Now that only two NUMERICAL variables (carat and volume) are going to be used, it is possible to multiply these two variables together and be plotted along with their combined linear regression model.
cVmodel <- lm(diamonds$price ~ diamonds$carat * diamonds$volume)
x <- diamonds$carat * diamonds$volume
summary(cVmodel)
##
## Call:
## lm(formula = diamonds$price ~ diamonds$carat * diamonds$volume)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10676.4 -472.9 -45.4 246.0 15613.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -581.9286 25.8246 -22.534 <2e-16 ***
## diamonds$carat 545.3612 243.0185 2.244 0.0248 *
## diamonds$volume 14.0462 1.5139 9.278 <2e-16 ***
## diamonds$carat:diamonds$volume 17.4083 0.2312 75.287 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1343 on 48800 degrees of freedom
## Multiple R-squared: 0.8434, Adjusted R-squared: 0.8434
## F-statistic: 8.764e+04 on 3 and 48800 DF, p-value: < 2.2e-16
plot(diamonds$price ~ x, main = "Carat and Volume VS Price Linear Regression Model", xlab="Carat and Volume",ylab="Price")
abline(lm(diamonds$price ~ x), col="red") # just for visual purposes
cat("EQUATION of the Line: y =", cVmodel$coefficients[1], "+", cVmodel$coefficients[2], "c +", cVmodel$coefficients[3], "v +", cVmodel$coefficients[4], "cv", "\nwhere c = carat and v = volume")
## EQUATION of the Line: y = -581.9286 + 545.3612 c + 14.04618 v + 17.40833 cv
## where c = carat and v = volume
The highest R^2 achieved in the previous plots was 0.8249 but with this model, we achieved a higher R^2 as 0.8434. This is some progress but including more variables would help with the model and increase the R^2 value.
Of course, the data variables that are non-numerical should not be ruled out as one that does not make a big different in price. What the below R code does is calcuate the average of each non-numerical value against price and plots it in a bar graph
means_cut <- sort(with(diamonds, tapply(price, cut, mean)))
x<-barplot(means_cut,main="Mean Values of the Price with each Cut Variation", xlab="Cut Quality", ylab="Mean Prices")
text(x,means_cut-210,labels=as.character(floor(means_cut)))
means_color <- sort(with(diamonds, tapply(price, color, mean)))
x<-barplot(means_color,main="Mean Values of the Price with each Color Variation", xlab="Color", ylab="Mean Prices")
text(x,means_color-210,labels=as.character(floor(means_color)))
means_clarity <- sort(with(diamonds, tapply(price, clarity, mean)))
x<-barplot(means_clarity,main="Mean Values of the Price with each Clarity Variation", xlab="Clarity", ylab="Mean Prices")
text(x,means_clarity-210,labels=as.character(floor(means_clarity)))
Since we see a clear increase in the average price for each different value in the non-numerical value.
Since we know that there is an effect of these variables on price, the ratio of each value against the lowest one can be caluculated and inputed within the current linear model of carat and volume. These ratios also be inputed in new columns in the diamonds data frame.
means_cut <- means_cut/means_cut[1]
means_color <- means_color/means_color[1]
means_clarity <- means_clarity/means_clarity[1]
diamonds$means_cut = means_cut[diamonds$cut]
diamonds$means_color = means_color[diamonds$color]
diamonds$means_clarity = means_clarity[diamonds$clarity]
head(diamonds)
## X carat cut color clarity depth table price x y z tab volume
## 1 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 55 38.20203
## 2 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 61 34.50586
## 4 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63 58 46.72458
## 5 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 58 51.91725
## 6 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 57 38.69395
## 7 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 57 38.83087
## means_cut means_color means_clarity
## 1 1.129814 1.037607 1.304503
## 2 1.208304 1.037607 1.302008
## 4 1.208304 1.305763 1.450031
## 5 1.099485 1.341794 1.304503
## 6 1.241950 1.341794 1.528706
## 7 1.241950 1.305763 1.468815
Now that there are ratios for the non-numerical variables, it is possible to add it to our linear model of carat and volume to form a stronger linear model.
finalModel <- lm(diamonds$price ~ diamonds$carat * diamonds$volume * diamonds$means_cut * diamonds$means_color * diamonds$means_clarity)
x <- diamonds$carat * diamonds$volume * diamonds$means_cut * diamonds$means_color * diamonds$means_clarity
summary(finalModel)
##
## Call:
## lm(formula = diamonds$price ~ diamonds$carat * diamonds$volume *
## diamonds$means_cut * diamonds$means_color * diamonds$means_clarity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8833.4 -346.2 -46.6 202.6 14164.6
##
## Coefficients:
## Estimate
## (Intercept) 281951.9
## diamonds$carat 7457334.8
## diamonds$volume -52117.1
## diamonds$means_cut -277805.0
## diamonds$means_color -194137.0
## diamonds$means_clarity -202416.9
## diamonds$carat:diamonds$volume 3442.0
## diamonds$carat:diamonds$means_cut -6513373.0
## diamonds$volume:diamonds$means_cut 46169.8
## diamonds$carat:diamonds$means_color -5753615.5
## diamonds$volume:diamonds$means_color 39908.3
## diamonds$means_cut:diamonds$means_color 195966.9
## diamonds$carat:diamonds$means_clarity -5697295.9
## diamonds$volume:diamonds$means_clarity 39525.7
## diamonds$means_cut:diamonds$means_clarity 200113.6
## diamonds$means_color:diamonds$means_clarity 138181.4
## diamonds$carat:diamonds$volume:diamonds$means_cut -3625.1
## diamonds$carat:diamonds$volume:diamonds$means_color -2524.9
## diamonds$carat:diamonds$means_cut:diamonds$means_color 5023956.4
## diamonds$volume:diamonds$means_cut:diamonds$means_color -35372.2
## diamonds$carat:diamonds$volume:diamonds$means_clarity -2404.7
## diamonds$carat:diamonds$means_cut:diamonds$means_clarity 4963235.0
## diamonds$volume:diamonds$means_cut:diamonds$means_clarity -34932.5
## diamonds$carat:diamonds$means_color:diamonds$means_clarity 4379166.7
## diamonds$volume:diamonds$means_color:diamonds$means_clarity -30150.4
## diamonds$means_cut:diamonds$means_color:diamonds$means_clarity -140397.9
## diamonds$carat:diamonds$volume:diamonds$means_cut:diamonds$means_color 2677.8
## diamonds$carat:diamonds$volume:diamonds$means_cut:diamonds$means_clarity 2609.1
## diamonds$carat:diamonds$volume:diamonds$means_color:diamonds$means_clarity 1766.8
## diamonds$carat:diamonds$means_cut:diamonds$means_color:diamonds$means_clarity -3813189.0
## diamonds$volume:diamonds$means_cut:diamonds$means_color:diamonds$means_clarity 26662.9
## diamonds$carat:diamonds$volume:diamonds$means_cut:diamonds$means_color:diamonds$means_clarity -1922.9
## Std. Error
## (Intercept) 64122.6
## diamonds$carat 842021.9
## diamonds$volume 5290.6
## diamonds$means_cut 54680.6
## diamonds$means_color 53017.2
## diamonds$means_clarity 46341.9
## diamonds$carat:diamonds$volume 627.8
## diamonds$carat:diamonds$means_cut 710570.9
## diamonds$volume:diamonds$means_cut 4467.9
## diamonds$carat:diamonds$means_color 663874.2
## diamonds$volume:diamonds$means_color 4177.1
## diamonds$means_cut:diamonds$means_color 45241.3
## diamonds$carat:diamonds$means_clarity 625248.1
## diamonds$volume:diamonds$means_clarity 3921.1
## diamonds$means_cut:diamonds$means_clarity 39505.3
## diamonds$means_color:diamonds$means_clarity 38301.9
## diamonds$carat:diamonds$volume:diamonds$means_cut 535.0
## diamonds$carat:diamonds$volume:diamonds$means_color 508.4
## diamonds$carat:diamonds$means_cut:diamonds$means_color 560576.4
## diamonds$volume:diamonds$means_cut:diamonds$means_color 3529.6
## diamonds$carat:diamonds$volume:diamonds$means_clarity 463.8
## diamonds$carat:diamonds$means_cut:diamonds$means_clarity 526894.2
## diamonds$volume:diamonds$means_cut:diamonds$means_clarity 3306.9
## diamonds$carat:diamonds$means_color:diamonds$means_clarity 493054.7
## diamonds$volume:diamonds$means_color:diamonds$means_clarity 3096.3
## diamonds$means_cut:diamonds$means_color:diamonds$means_clarity 32673.1
## diamonds$carat:diamonds$volume:diamonds$means_cut:diamonds$means_color 433.3
## diamonds$carat:diamonds$volume:diamonds$means_cut:diamonds$means_clarity 395.2
## diamonds$carat:diamonds$volume:diamonds$means_color:diamonds$means_clarity 375.2
## diamonds$carat:diamonds$means_cut:diamonds$means_color:diamonds$means_clarity 415757.2
## diamonds$volume:diamonds$means_cut:diamonds$means_color:diamonds$means_clarity 2612.9
## diamonds$carat:diamonds$volume:diamonds$means_cut:diamonds$means_color:diamonds$means_clarity 319.7
## t value
## (Intercept) 4.397
## diamonds$carat 8.856
## diamonds$volume -9.851
## diamonds$means_cut -5.081
## diamonds$means_color -3.662
## diamonds$means_clarity -4.368
## diamonds$carat:diamonds$volume 5.483
## diamonds$carat:diamonds$means_cut -9.166
## diamonds$volume:diamonds$means_cut 10.334
## diamonds$carat:diamonds$means_color -8.667
## diamonds$volume:diamonds$means_color 9.554
## diamonds$means_cut:diamonds$means_color 4.332
## diamonds$carat:diamonds$means_clarity -9.112
## diamonds$volume:diamonds$means_clarity 10.080
## diamonds$means_cut:diamonds$means_clarity 5.065
## diamonds$means_color:diamonds$means_clarity 3.608
## diamonds$carat:diamonds$volume:diamonds$means_cut -6.776
## diamonds$carat:diamonds$volume:diamonds$means_color -4.967
## diamonds$carat:diamonds$means_cut:diamonds$means_color 8.962
## diamonds$volume:diamonds$means_cut:diamonds$means_color -10.022
## diamonds$carat:diamonds$volume:diamonds$means_clarity -5.184
## diamonds$carat:diamonds$means_cut:diamonds$means_clarity 9.420
## diamonds$volume:diamonds$means_cut:diamonds$means_clarity -10.564
## diamonds$carat:diamonds$means_color:diamonds$means_clarity 8.882
## diamonds$volume:diamonds$means_color:diamonds$means_clarity -9.738
## diamonds$means_cut:diamonds$means_color:diamonds$means_clarity -4.297
## diamonds$carat:diamonds$volume:diamonds$means_cut:diamonds$means_color 6.180
## diamonds$carat:diamonds$volume:diamonds$means_cut:diamonds$means_clarity 6.602
## diamonds$carat:diamonds$volume:diamonds$means_color:diamonds$means_clarity 4.709
## diamonds$carat:diamonds$means_cut:diamonds$means_color:diamonds$means_clarity -9.172
## diamonds$volume:diamonds$means_cut:diamonds$means_color:diamonds$means_clarity 10.205
## diamonds$carat:diamonds$volume:diamonds$means_cut:diamonds$means_color:diamonds$means_clarity -6.014
## Pr(>|t|)
## (Intercept) 1.10e-05
## diamonds$carat < 2e-16
## diamonds$volume < 2e-16
## diamonds$means_cut 3.78e-07
## diamonds$means_color 0.000251
## diamonds$means_clarity 1.26e-05
## diamonds$carat:diamonds$volume 4.21e-08
## diamonds$carat:diamonds$means_cut < 2e-16
## diamonds$volume:diamonds$means_cut < 2e-16
## diamonds$carat:diamonds$means_color < 2e-16
## diamonds$volume:diamonds$means_color < 2e-16
## diamonds$means_cut:diamonds$means_color 1.48e-05
## diamonds$carat:diamonds$means_clarity < 2e-16
## diamonds$volume:diamonds$means_clarity < 2e-16
## diamonds$means_cut:diamonds$means_clarity 4.09e-07
## diamonds$means_color:diamonds$means_clarity 0.000309
## diamonds$carat:diamonds$volume:diamonds$means_cut 1.25e-11
## diamonds$carat:diamonds$volume:diamonds$means_color 6.84e-07
## diamonds$carat:diamonds$means_cut:diamonds$means_color < 2e-16
## diamonds$volume:diamonds$means_cut:diamonds$means_color < 2e-16
## diamonds$carat:diamonds$volume:diamonds$means_clarity 2.18e-07
## diamonds$carat:diamonds$means_cut:diamonds$means_clarity < 2e-16
## diamonds$volume:diamonds$means_cut:diamonds$means_clarity < 2e-16
## diamonds$carat:diamonds$means_color:diamonds$means_clarity < 2e-16
## diamonds$volume:diamonds$means_color:diamonds$means_clarity < 2e-16
## diamonds$means_cut:diamonds$means_color:diamonds$means_clarity 1.73e-05
## diamonds$carat:diamonds$volume:diamonds$means_cut:diamonds$means_color 6.44e-10
## diamonds$carat:diamonds$volume:diamonds$means_cut:diamonds$means_clarity 4.10e-11
## diamonds$carat:diamonds$volume:diamonds$means_color:diamonds$means_clarity 2.50e-06
## diamonds$carat:diamonds$means_cut:diamonds$means_color:diamonds$means_clarity < 2e-16
## diamonds$volume:diamonds$means_cut:diamonds$means_color:diamonds$means_clarity < 2e-16
## diamonds$carat:diamonds$volume:diamonds$means_cut:diamonds$means_color:diamonds$means_clarity 1.82e-09
##
## (Intercept) ***
## diamonds$carat ***
## diamonds$volume ***
## diamonds$means_cut ***
## diamonds$means_color ***
## diamonds$means_clarity ***
## diamonds$carat:diamonds$volume ***
## diamonds$carat:diamonds$means_cut ***
## diamonds$volume:diamonds$means_cut ***
## diamonds$carat:diamonds$means_color ***
## diamonds$volume:diamonds$means_color ***
## diamonds$means_cut:diamonds$means_color ***
## diamonds$carat:diamonds$means_clarity ***
## diamonds$volume:diamonds$means_clarity ***
## diamonds$means_cut:diamonds$means_clarity ***
## diamonds$means_color:diamonds$means_clarity ***
## diamonds$carat:diamonds$volume:diamonds$means_cut ***
## diamonds$carat:diamonds$volume:diamonds$means_color ***
## diamonds$carat:diamonds$means_cut:diamonds$means_color ***
## diamonds$volume:diamonds$means_cut:diamonds$means_color ***
## diamonds$carat:diamonds$volume:diamonds$means_clarity ***
## diamonds$carat:diamonds$means_cut:diamonds$means_clarity ***
## diamonds$volume:diamonds$means_cut:diamonds$means_clarity ***
## diamonds$carat:diamonds$means_color:diamonds$means_clarity ***
## diamonds$volume:diamonds$means_color:diamonds$means_clarity ***
## diamonds$means_cut:diamonds$means_color:diamonds$means_clarity ***
## diamonds$carat:diamonds$volume:diamonds$means_cut:diamonds$means_color ***
## diamonds$carat:diamonds$volume:diamonds$means_cut:diamonds$means_clarity ***
## diamonds$carat:diamonds$volume:diamonds$means_color:diamonds$means_clarity ***
## diamonds$carat:diamonds$means_cut:diamonds$means_color:diamonds$means_clarity ***
## diamonds$volume:diamonds$means_cut:diamonds$means_color:diamonds$means_clarity ***
## diamonds$carat:diamonds$volume:diamonds$means_cut:diamonds$means_color:diamonds$means_clarity ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1036 on 48772 degrees of freedom
## Multiple R-squared: 0.9069, Adjusted R-squared: 0.9069
## F-statistic: 1.533e+04 on 31 and 48772 DF, p-value: < 2.2e-16
plot(diamonds$price ~ x, main = "Final Linear Regression Model", xlab="Carat and Volume",ylab="Price")
abline(lm(diamonds$price ~ x), col="red") # just for visual purposes
cat("EQUATION of the Line: y =", finalModel$coefficients[1], "+",
finalModel$coefficients[2], "c +",
finalModel$coefficients[3], "v +",
finalModel$coefficients[4], "u +",
finalModel$coefficients[5], "o +",
finalModel$coefficients[6], "a +",
finalModel$coefficients[7], "cv + ",
finalModel$coefficients[8], "cu + ",
finalModel$coefficients[9], "vu + ",
finalModel$coefficients[10], "co + ",
finalModel$coefficients[11], "vo + ",
finalModel$coefficients[12], "uo + ",
finalModel$coefficients[13], "ca + ",
finalModel$coefficients[14], "va + ",
finalModel$coefficients[15], "ua + ",
finalModel$coefficients[16], "oa + ",
finalModel$coefficients[17], "cvu + ",
finalModel$coefficients[18], "cvo + ",
finalModel$coefficients[19], "cuo + ",
finalModel$coefficients[20], "vuo + ",
finalModel$coefficients[21], "cva + ",
finalModel$coefficients[22], "cua + ",
finalModel$coefficients[23], "vua + ",
finalModel$coefficients[24], "coa + ",
finalModel$coefficients[25], "voa + ",
finalModel$coefficients[26], "uoa + ",
finalModel$coefficients[27], "cvuo + ",
finalModel$coefficients[28], "cvua + ",
finalModel$coefficients[29], "cvoa + ",
finalModel$coefficients[30], "cuoa + ",
finalModel$coefficients[31], "vuoa + ",
finalModel$coefficients[32], "cvuoa",
"\nwhere c = carat and v = volume and u = means_cut and o = means_color and a = means_clarity")
## EQUATION of the Line: y = 281951.9 + 7457335 c + -52117.11 v + -277805 u + -194137 o + -202416.9 a + 3442.02 cv + -6513373 cu + 46169.8 vu + -5753615 co + 39908.26 vo + 195966.9 uo + -5697296 ca + 39525.66 va + 200113.6 ua + 138181.4 oa + -3625.142 cvu + -2524.948 cvo + 5023956 cuo + -35372.2 vuo + -2404.714 cva + 4963235 cua + -34932.48 vua + 4379167 coa + -30150.44 voa + -140397.9 uoa + 2677.757 cvuo + 2609.128 cvua + 1766.833 cvoa + -3813189 cuoa + 26662.93 vuoa + -1922.923 cvuoa
## where c = carat and v = volume and u = means_cut and o = means_color and a = means_clarity
With this final linear model calculated, the final R^2 value discovered is 0.9069 which is better than the original 0.8249 when carat was plotted against price alone. This model can be used to determine the price of a diamonds based on its individual characteristics within statistical significance due to the high R^2 value.
In the future, perhaps instead of doing a simple linear regression model, I can add more polynomial regressions to account for any devistions from the line as the x value grows faster than the lines. Also a machine learning model can be trained using this data (split into test and training data), which may be more useful and accurate with a series of convolutional layers.